Skip to content

[https://nvbugs/6179661][fix] Fix disagg generation-side KV transfer timeout deadlocks and teardown crashes#15363

Closed
nv-xtf wants to merge 1 commit into
NVIDIA:mainfrom
nv-xtf:dev-tingfengx-disagg-kv-deadlock-fix
Closed

[https://nvbugs/6179661][fix] Fix disagg generation-side KV transfer timeout deadlocks and teardown crashes#15363
nv-xtf wants to merge 1 commit into
NVIDIA:mainfrom
nv-xtf:dev-tingfengx-disagg-kv-deadlock-fix

Conversation

@nv-xtf

@nv-xtf nv-xtf commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Bug Fixes

    • Improved shutdown and teardown reliability for distributed cache management operations.
    • Enhanced timeout detection and error handling for generation transfers in distributed settings.
    • Added safeguards against invalid iterator access during transfer response processing.
  • Chores

    • Updated environment configuration for distributed communication setup.
    • Refactored transfer status polling with improved multi-rank synchronization and budget management.

Description

In disaggregated serving, when a generation-side KV cache transfer does not complete in time (e.g. under heavy transport load), the PyTorch executor/cache-transceiver path could deadlock or crash the engine. This PR fixes the following issues:

  1. Unbounded blocking in checkGenTransferStatus (C++). To satisfy atLeastRequestNum, the generation-side status check force-selected not-yet-ready requests and then blocked in future::get() with no bound. A transfer that never completes wedged that rank inside get() while sibling ranks advanced to the next collective, deadlocking the whole sync group. It is now a bounded poll: only futures ready on every rank of the sync group are completed, the wait for atLeastRequestNum is capped, and incomplete transfers are left in flight for later checks. All loop-exit decisions are derived from rank-identical data so every rank runs the same number of collectives.

  2. Generation-side transfer timeouts were not acted upon consistently (Python). The timeout flag was set but its only consumer was unreachable behind the blocking call above, and it was rank-local. _cancel_timed_out_gen_transfers now runs every iteration on every rank and reaches a TP-wide consensus before changing request state: an OR-gate allreduce to skip when nothing timed out, an allgather union of the timed-out ids (so wall-clock skew between ranks does not diverge the set), and a second allgather on the per-rank cancel result so a request is only failed once every rank tracking it has cancelled its share.

  3. Collective gated on rank-local error state (Python). _check_cache_transfer_errors only entered _handle_errors (→ _enqueue_responses → tp_gather) when the local rank had errored requests. Under attention DP a cancel succeeds on a subset of ranks, so some ranks entered the gather while others skipped it, deadlocking the TP group. The generation-side path now flushes errors on every rank unconditionally.

  4. Use-after-free on teardown (C++). ~CacheTransceiver relied on member destruction order, but ConnectionManager is declared after — and thus destroyed before — CacheSender/CacheReceiver, whose background threads still dereferenced it, causing a segfault during shutdown. The destructor now resets the sender and receiver first, while the manager is still alive.

  5. Two shutdown-path crashes in CacheSender::response() (C++). (a) recvRequestInfo returning a null connection was only handled when the manager had already stopped; otherwise an empty RequestInfo flowed into processing and threw bad_optional_access. (b) Terminating while waiting in the inner loop left the iterator at end(), which was then dereferenced in sendResponse. Both now bail out cleanly when terminating.

A bounded-poll metric (waited calls / total poll iterations / budget-exhaustion count) is logged on budget exhaustion to keep the throughput impact of the non-blocking rewrite observable.

Test Coverage

Validated with the disaggregated perf-sanity case perf/test_perf_sanity.py::test_e2e[disagg_upload-e2e-gb200_kimi-k25-thinking-fp4_8k1k_con4096_ctx1_dep4_gen1_dep16_eplb0_mtp0_ccb-NIXL] on GB200 (gen TP16 / attention-DP, ctx TP4, NIXL)

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@nv-xtf nv-xtf requested review from a team as code owners June 15, 2026 05:56
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Replaces unbounded, timeout-driven generation KV transfer completion in CacheTransceiver::checkGenTransferStatus with a bounded polling loop using a new allgatherIntValue helper for rank-consistent completion. Adds a centralized _cancel_timed_out_gen_transfers helper in Python, removes the old inline timeout branch from _handle_responses, fixes teardown ordering and shutdown-path guards in sender/receiver threads, and unsets UCX_NET_DEVICES alongside UCX_TLS in disaggregated launch scripts.

Changes

Disaggregated KV Transfer Timeout & Polling Rework

Layer / File(s) Summary
New polling metrics and cancellation state
cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h, tensorrt_llm/_torch/pyexecutor/llm_request.py
Removes mTimedOutRequesterIds dedup set, adds three mGenPoll* counters on CacheTransceiver, and initializes py_kv_transfer_cancelled = False on LlmRequest.
Teardown ordering and shutdown safety guards
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp, cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
Reorders destructor to reset sender/receiver before closing the UCX handle; adds null-connection early-return in recvRequestInfo; adds terminate-guard break in the sender response wait loop.
allgatherIntValue helper and bounded polling in checkGenTransferStatus
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
Adds <chrono>/<cstddef>/<numeric>/<thread> includes, introduces allgatherIntValue for single-int rank coordination, and replaces checkGenTransferStatus with a bounded poll loop that uses allgather to align completion targets and conditionally updates KV transfer metrics.
Centralized Python gen-transfer timeout cancellation
tensorrt_llm/_torch/pyexecutor/py_executor.py
Adds _cancel_timed_out_gen_transfers called from _check_disagg_gen_transfer_status; builds a per-rank timed-out set, aligns across TP ranks via collectives, cancels transfers, marks DISAGG_TRANS_ERROR, and calls _handle_errors(charge_budget=False). Removes the old inline py_kv_transfer_timed_out branch from _handle_responses.

UCX Environment Cleanup Fix

Layer / File(s) Summary
Unset UCX_NET_DEVICES alongside UCX_TLS
jenkins/scripts/perf/local/submit.py
Changes ucx_tls_cmd in the non-B200 disaggregated launch path to unset both UCX_TLS and UCX_NET_DEVICES.

Sequence Diagram(s)

sequenceDiagram
  participant Exec as PyExecutor
  participant Cancel as _cancel_timed_out_gen_transfers
  participant TP as TP Collective
  participant Transceiver as kv_cache_transceiver (C++)
  participant Poll as checkGenTransferStatus

  rect rgba(100, 149, 237, 0.5)
    note over Exec,Poll: Per-iteration disagg gen transfer check
    Exec->>Poll: checkGenTransferStatus()
    Poll->>Poll: allgatherIntValue (rank-align ready count)
    loop up to poll budget
      Poll->>Poll: check futures (wait_for 0ms)
      Poll->>Poll: sleep if not ready
    end
    Poll-->>Exec: completed requests erased from mRequesterFutures
  end

  rect rgba(255, 140, 0, 0.5)
    note over Exec,Cancel: Centralized timeout cancellation (new)
    Exec->>Cancel: _cancel_timed_out_gen_transfers()
    Cancel->>TP: allreduce timed-out request IDs
    TP-->>Cancel: aligned set across TP ranks
    Cancel->>Transceiver: cancel_request(request)
    Cancel->>Cancel: mark DISAGG_TRANS_ERROR
    Cancel->>Exec: _handle_errors(charge_budget=False)
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#15139: Directly related — implements consensus-based deferral of generation transfer terminal-state transitions that this PR's bounded polling and allgather coordination builds upon.

Suggested reviewers

  • Shixiaowei02
  • pcastonguay
  • joyang-nv
  • Tabrizian
  • bo-nv
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 6.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main fix: addressing KV transfer timeout deadlocks and teardown crashes in disaggregated serving, with a clear reference to the NVBugs ticket.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description comprehensively explains all five issues being fixed, provides clear rationales for each change, lists relevant test coverage, and includes a completed PR checklist.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp (1)

707-708: ⚡ Quick win

Use TensorRT-LLM constant naming for the new poll constants.

The new constants should use camelCase with a k prefix, e.g. kMaxPollIterations and kPollInterval. As per coding guidelines, “Constants should follow naming convention: camelCase with prefix 'k'.”

♻️ Proposed rename
-    static constexpr int kMAX_POLL_ITERATIONS = 32;
-    static constexpr auto kPOLL_INTERVAL = std::chrono::milliseconds(2);
+    static constexpr int kMaxPollIterations = 32;
+    static constexpr auto kPollInterval = std::chrono::milliseconds(2);

Also update the later references in this function accordingly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp` around lines 707 - 708,
The constants kMAX_POLL_ITERATIONS and kPOLL_INTERVAL do not follow the
TensorRT-LLM naming convention. Rename kMAX_POLL_ITERATIONS to
kMaxPollIterations and kPOLL_INTERVAL to kPollInterval to use camelCase with the
'k' prefix as per the coding guidelines. Then update all references to these
constants throughout the function to use the new names.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 714-718: The checkGenTransferStatus() method in
cacheTransceiver.cpp with the default atLeastRequestNum parameter now only polls
for a limited time before returning, while transfers may still be in flight.
This breaks the destructor's assumption in trtGptModelInflightBatching.cpp that
checkGenTransferComplete() will be true after calling checkGenTransferStatus().
Either update the default behavior of checkGenTransferStatus() to fully complete
all transfers when atLeastRequestNum is nullopt (restoring backward
compatibility), or ensure all callers explicitly loop and reinvoke
checkGenTransferStatus() until checkGenTransferComplete() returns true instead
of relying on a single call with the default. Additionally, rename the constants
kMAX_POLL_ITERATIONS and kPOLL_INTERVAL at lines 707–708 to use camelCase naming
convention (kMaxPollIterations and kPollInterval) to match coding guidelines.

In `@tensorrt_llm/_torch/pyexecutor/py_executor.py`:
- Around line 4117-4120: The `_check_cache_transfer_errors()` method currently
calls `_handle_errors()` only when the local rank has errors, which can cause
rank inconsistency when transfer errors appear on a subset of ADP ranks. Both
the location at lines 4117-4120 and the similar location at lines 4442-4447 need
to be modified to ensure error handling occurs after TP consensus across all
ranks, consistent with how the timeout path handles errors. Restructure the
error handling in both locations so that `_handle_errors()` is called after TP
consensus has completed, rather than conditionally only for ranks with local
errors, ensuring all ranks remain synchronized when entering
`_enqueue_responses()` TP gather.
- Line 4046: The method _cancel_timed_out_gen_transfers() at line 4046 depends
on timeout flags being set by _check_kv_transfer_timeout(), but this check is
not being called unconditionally before the cancellation method. In pipeline
parallel generation-only transfer stalls, the py_kv_transfer_timed_out flag may
never be set, causing the cancellation logic to remain idle. Add a call to
_check_kv_transfer_timeout() immediately before the
_cancel_timed_out_gen_transfers() call to ensure timeout flags are refreshed and
the cancellation path can execute properly.
- Line 4051: Add a return type annotation of `-> None` to the method signature
of `_cancel_timed_out_gen_transfers` at line 4051, since the method has multiple
early returns and does not return a value. Additionally, locate the `zip()` call
at line 4106 that combines `global_ok` and `global_timed_out` (which are
explicitly constructed to have matching lengths) and add the `strict=True`
parameter to it to satisfy the Ruff B905 check and enforce that both iterables
have the same length during iteration.

---

Nitpick comments:
In `@cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp`:
- Around line 707-708: The constants kMAX_POLL_ITERATIONS and kPOLL_INTERVAL do
not follow the TensorRT-LLM naming convention. Rename kMAX_POLL_ITERATIONS to
kMaxPollIterations and kPOLL_INTERVAL to kPollInterval to use camelCase with the
'k' prefix as per the coding guidelines. Then update all references to these
constants throughout the function to use the new names.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 5d852be3-a13a-4c59-9480-0e4a898c264b

📥 Commits

Reviewing files that changed from the base of the PR and between b1ee4ab and c94ad5a.

📒 Files selected for processing (6)
  • cpp/include/tensorrt_llm/batch_manager/cacheTransceiver.h
  • cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp
  • cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp
  • jenkins/scripts/perf/local/submit.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/py_executor.py

Comment thread cpp/tensorrt_llm/batch_manager/cacheTransceiver.cpp Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/py_executor.py
… deadlocks and teardown crashes

Signed-off-by: Tingfeng Xian <289617005+nv-xtf@users.noreply.github.com>
@nv-xtf nv-xtf force-pushed the dev-tingfengx-disagg-kv-deadlock-fix branch from c94ad5a to 68ff9d8 Compare June 15, 2026 06:25
@nv-xtf

nv-xtf commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54449 [ run ] triggered by Bot. Commit: 68ff9d8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54449 [ run ] completed with state FAILURE. Commit: 68ff9d8
/LLM/main/L0_MergeRequest_PR pipeline #43512 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@nv-xtf

nv-xtf commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

/bot rerun

@github-actions

Copy link
Copy Markdown

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental) --high-priority]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Supports wildcard * for pattern matching (e.g., "*PerfSanity*" matches all stages containing PerfSanity). Examples: "A10-PyTorch-1, xxx", "PerfSanity". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Supports wildcard * for pattern matching. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx", --extra-stage "Post-Merge".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

--high-priority (OPTIONAL) : Run the pipeline with high priority. This option is restricted to authorized users only and will route the job to a high-priority queue.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@nv-xtf

nv-xtf commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast --reuse-test

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54476 [ run ] triggered by Bot. Commit: 68ff9d8 Link to invocation

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #54476 [ run ] completed with state SUCCESS. Commit: 68ff9d8
/LLM/main/L0_MergeRequest_PR pipeline #43539 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@chienchunhung chienchunhung left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up from the related #15356 thread: thanks for putting this together, and for calling out the ADP timeout/cancellation issue clearly there.

After reviewing this PR against the current cancellation PR chain, I plan to split the work across multiple PRs for better scoping and deduced risks:

  • #15422 carries the teardown/lifetime hardening from this PR. That keeps shutdown safety independent and lets it land before #15238 if possible.
  • #15238 should own the generation-side timeout/cancellation semantics, including rank-consistent agreement on the timeout/cancellable request set before _handle_errors() / _enqueue_responses() can enter tp_gather.
  • #15356 -> #15238 remains the bounded-polling / request-cancellation chain. #15356 itself should stay focused on V2/Python transceiver bounded context polling, not cancellation semantics.

tl;dr: the ADP tp_gather participant mismatch is a real gap, intentionally not addressed in #15356, and should be handled in #15238 through rank-consistent cancellation. The teardown hardening is independent from the dependency graph and is split into #15422. I’ll port/reimplement the relevant cancellation-consensus logic from this PR into #15238.

@nv-xtf nv-xtf closed this Jun 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants